Mean-field methods in evolutionary duplication-innovation-loss models for the 
genome-level repertoire of protein domains 



A. Angelini/ A. Amato/ G. Bianconi,^ B. Bassetti,^ and M. Cosentino Lagomarsino^ 

^ Universita degli Studi di Milano, Dip. Fisica. Via Celoria 16, 20133 Milano, Italy 
^Physics Department, Northeastern University, 111 Forsyth St., 
Ill Dana Research Center, Boston 02115, MA, USA 
(Dated: January 22, 2010) 

We present a combined mean-field and simulation approach to different models describing the dy- 
namics of classes formed by elements that can appear, disappear or copy themselves. These models, 
related to a paradigm duplication-innovation model known as Chinese Restaurant Process, are de- 
vised to reproduce the scaling behavior observed in the genome-wide repertoire of protein domains 
of all known species. In view of these data, we discuss the qualitative and quantitative differences 
of the alternative model formulations, focusing in particular on the roles of element loss and of the 
specificity of empirical domain classes. 

I. INTRODUCTION 

Understanding the complexity of genomes and the drives that shape them is a fundamental problem 
of contemporary biology [1], which poses a number of challenges to contemporary statistical mechanics. 
Considering this problem from a large-scale viewpoint, the basic observables to account for are the distri- 
butions of the different "functional components" (such as genes, introns, non-coding RNA, etc.) encoded 
by sequenced genomes of varying size. 

When these genome- wide data are parametrized by measures of "genome size" (such as the number of 
bases or the number of genes in a genome), there are important emerging "scaling laws" both for the classes 
of evolutionary related genes [2-A], the functional categories of genes [5, 6] and some non-coding parts of 
genomes [7]. These scaling laws are the signs of universal invariants in the processes and constraints that 
gave rise to the genomes as they can be observed today. A current challenge is the understanding of these 
laws using physical modeling concepts and the comparison of the models to the available whole-genome 
data. This effort can help disentangle neutral from selective effects [8, 9]. 

Here we consider the statistical features of the set of proteins expressed by a genome, or proteome. A 
convenient level of analysis is a description of the proteome in terms of structural protein domains [10]. 
Domains are modular "topologies", or sub-shapes, forming proteins [11]. A domain determines a set of po- 
tential biochemical or biophysical functions and interactions for a protein, such as binding to other proteins 
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or DNA and participation in well-defined classes of biochemical reactions [6, 10]. Despite the practically 
unlimited number of possible protein sequences, the repertoire of basic topologies for domains seems to be 
relatively small [12]. With a loose parallel, domains could be seen as an "alphabet" of basic elements of the 
protein universe. Understanding the usage of domains across organisms is as important and challenging as 
decoding an unknown language. 

The content of a genome is determined primarily by its evolutionary history, in which neutral processes 
and natural selection play interdependent roles. In particular, the coding parts of genomes evolve by some 
well-defined basic "moves": gene loss, gene duplication, horizontal gene transfer (the transfer of genetic 
material between unrelated species), and gene genesis (the de novo origin of genes). Since domains are 
modular evolutionary building blocks for proteins, they are coupled to the dynamics followed by genes. 
In particular, a new domain topology can emerge by genesis or horizontal transfer, and new domains of 
existing domain topologies can emerge by duplication or be lost. Finally, topologies can be completely lost 
by a genome if the last domain that carries them is lost (see Fig. 1). 

Large-scale data concerning structural domains are available from bioinformatic databases, and can be 
analyzed at the genome level. These coarse-grained data structures can be represented as sets of "domain 
classes" (the sets of all realizations of the same domain topology in proteins), populated by domain re- 
aUzations. In particular, much attention has been drawn by the intriguing discovery that the population 
of domain classes have power-law distributions [13-17]: the number F{j,n) of domain classes having j 
members follows the power-law ~ where the exponent z typically lies between 1 and 2. An inter- 

esting thread of modeling work ascribes the emergence of power-laws to a generic preferential-attachment 
principle due to gene duplication. Growth models are formulated as nonstationary, duplication-innovation 
models [13, 18, 19] and as stationary birth-death-innovation models [14, 20-22]. 

Intriguingly, as we have recently shown [23], the domain content of genomes also exhibits scaling laws 
as a function of the total number of domains n, indicating that even evolutionarily distant genomes show 
common trends where the relevant parameter is their size. 

(i) The number of domain classes (or distinct hits of the same domain) concentrates around a master 
curve F{n) that appears to be markedly subhnear with size, perhaps saturating. 

(ii) The fitted exponent of the power-law-Uke distribution F{j, n) of domain classes having j members, 
in a proteome of size n decreases with genome size. In other words, there is evidence for a cutoff 
that increases linearly with n. 

(iii) The occurrence of fold topologies across genomes is highly inhomogeneous - some domain super- 
families are found in all genomes, some rare, with a sigmoid-like drop between these two cate- 
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gories [35]. 

We recently reported the above collective trends, and showed how the scaling laws in the data could be 
reproduced using universal parameters with non-stationary duplication-innovation models. Our results in- 
dicate that the basic evolutionary moves themselves can determine the observed scaling behavior of domain 
content, a priori of more specific biological trends. 

This modeling approach, while similar in formulation to that of previous investigators [13] who did not 
consider these scaling laws, has important modifications, mostly related to the scaling with n of the relative 
probability of adding a domain belonging to a new class and duplicating an existing one. To reproduce the 
observed trends, newly added domain classes cannot be treated as independent random variables, but are 
conditioned by the preexisting proteome structure. 

In this paper, we give a detailed account of this modeling approach, considering different variants of 
dupUcation-innovation-loss models for protein domains, and relate them to available results in the mathe- 
matical and physical literature. In particular, we will focus on mean-field approaches for the models and 
comparison with direct simulation, and we will show how they can be generally used to obtain the main 
qualitative and quantitative trends. 

The first part of the paper is devoted to the minimal model formulation, which only includes duplication 
and innovation moves for domains, and relates to the so-called Chinese Restaurant Process (CRP) of the 
mathematical literature [24, 25]. We will review the main known results for this model, derive analytically 
solvable mean-field equations, and show how they compare to the available rigorous results and the finite 
size behavior. 

The rest of the paper is devoted to biologically motivated variants of the main model related to two 
main features: including the role of loss of domains, which is a frequently reported event, and breaking the 
exchange symmetry of domain classes, which is unrealistic, as specific protein domains perform different 
biological functions. For these variants, we will present mean-field and simulations results, and characterize 
their phenomenology in relation with empirical data. In particular, while in general for these variants the 
rigorous mathematical results existing for the CRP break down, we will show how the use of simple mean- 
field methods proves to be a robust tool for accessing the qualitative phenomenology. 



4 



n. GENERIC FEATURES OF THE MODEL 

The model represents a proteome through its repertoire of domains. Domains having the same topology 
are collected in domain classes (Fig. 1). Thus the relevant data structures are partitions of elements (do- 
mains) into classes. The basic observables considered are the following: n, the total number of domains, 
/(n), a random variable indicating the number of classes (distinct domain topologies) at size n, a random 
variable ki, the population of class i, ni, the size at birth of class i, and f{j,n): the number of domain 
classes having j members. We will generally indicate mean values by capitalized letters (e.g. F{n) is the 
mean value of /(n), Ki the mean value of ki, etc.). 

The model is conceived as a stochastic process based on the elementary moves available to a genome 
(Fig. 1) of adding and losing domains, associated to relative probabiUties: po, the probability to duphcate 
an old domain (modehng gene duplication), pN, the probability to add a new domain class with one member 
(which describes domain innovation, for example by horizontal transfer), and pL, a loss probabiUty (which 
we will initially disregard, and consider in a second step). Iteratively, either a domain is added or it is lost 
with the prescribed probabilities. 

An important feature of the duplication move is the (null) hypothesis that duplication of a domain has 
uniform probability along the genome, and thus it is more probable to pick a domain of a larger class. This 
is a common feature with previous models [13]. This hypothesis creates a "preferential attachment" [26] 
principle, stating the fact that dupUcation is more likely in a larger domain class, which, in this model as in 
previous ones, is responsible for the emergence of power-law distributions. 

In mathematical terms, if the dupUcation probability is split as the sum of per-class probabilities Pq, 
this hypothesis requires that Pq oc ki, where ki is the population of class i, i.e. the probability of finding a 
domain of a particular class and duplicating it is proportional to the number of members of that class. 

It is important to notice that in this model, while n can be used as an arbitrary measure of time, the ratio 
of the time- scales of duplication and innovation is not arbitrary, and is set by the ratio pn/po- In the model 
of Gerstein and coworkers, this is taken as a constant, as the innovation move considered to be statistically 
independent from the genome content. In particular, both probabilities are considered to be constant. This 
choice has two problems. First, it cannot give the observed sublinear scaling of F{n). Indeed, if the 
probability of adding a new domain is constant with n, so will be the rate of addition, implying that this 
quantity will increase on average linearly with genome size. Moreover, the same model gives power-law 
distributions for the classes with exponent larger than two, in contrast with most of the available data. 

Previous investigators did not consider the fact that genomes cluster around a common curve [23], and 
thought of each of them as coming from an independent stochastic process with different parameters [13]. 
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Coarse-grained view of a proteome: 
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FIG. 1: (Color online) Scheme of the generic features of the duplication-innovation-loss model. Top: The sequences 
of proteins coded by a genome can be broken down into domains, represented by colored boxes. All the boxes of 
the same color, representing domains with the same topology are collected in the same domain class. Bottom: basic 
moves. Elements of a class can be duplicated or lost, and new classes can be formed with prescribed probabilities. 

Furthermore, choosing constant implies that for larger genomes the influx of new domain families is 
heavily dominant on the flux of duplicated domains. 

As noted by Durrett and Schweinsberg [19], constant pN makes sense only if one thinks that new fold 
topologies emerge from an internal "nucleation-like" process with constant rate, rather than from an external 
flux. This process could be pictured as the genesis of new topologies from sequence mutation. Empirically, 
while genesis events are reported [27] and must occur, it is clear that domain topologies are very stable, and 
the exploration of sequence space is not free, but conditioned by a number of additional important factors, 
including chromosomal position and expression patterns of genes, and their role in biological networks [28]. 
Moreover, in prokaryotes, it is known that a large contribution to the innovation of coding genomes is 
provided by horizontal gene transfer [27], the exchange of genetic material between species, which can 
be reasonably represented in a model as an external flux, as opposed to the internal nucleation process 
representing genesis. 

For eukaryotes, horizontal gene transfer is less important, and there can be multiple relevant innovation 
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processes including exonization, loss of exons, alternative start sites changing the protein. We have not 
attempted to model the detailed processes leading to innovation, and because of their higher complexity in 
eukaryotes, we prefer to compare the model to the prokaryote data set alone. However, we can point out that 
in principle the same model has good agreement with the set of prokaryotes and eukaryotes together [23]. 
In eukaryotes, the change in number of classes with respect to size change is generally small, but seems to 
have a trend that "glues" quite well with prokaryotes (and in particular, innovation decreases with size). 

Motivated by the subhnear scaling of the number of domain classes, and taking into account in an 
effective way the role of processes that condition the addition of new domain topologies, we consider 
statistically dependent moves. On general grounds, if a genome is a complex system where sub-components 
interact in clusters and non-locally, domain topologies as well have to be coordinated with other parts of the 
system, so that it is reasonable that evolutionary moves are conditioned by what is already present, and that 
the actual number of domain topologies need not to be trivially an extensive quantity. 

The simplest way to implement this choice is to concentrate on the innovation process. Let us consider 
the indicator ^(n), taking value 1 if a new domain class is born at size n, and value otherwise. The 
number of classes at size n will be f{n) = Z^j<„ ^(j). If the random variables ^{n) are independent and 
identically distributed, i.e. p{^{n) = 1) = pn =constant, /(n) follows a Bernoulli distribution whose mean 
value is linear in n. Moreover, will be increasingly concentrated with increasing n on the deterministic 
value Pat- If vice versa the random variables ^{n) are statistically dependent or also simply not identically 
distributed, the mean value may not be hnear in n, and the concentration phenomenon may not occur. 
Both features, dependence and lack of concentration, are important. The former is necessary to obtain 
the observed sublinear behavior, the latter might create an intrinsic "diversity" in the genome ensemble, 
independently on the finite size of observed genomes (however, the currently available data are insufficient 
to estabhsh this empirically). 

in. SIMPLEST FORMULATION: CHINESE RESTAURANT PROCESS 

We investigate this process using analytical asymptotic equations and simulations. We start by consid- 
ering only growth moves, by duphcation and innovation, postponing the inclusion of domain loss in the 
model. We will see that the resulting model contains the basic qualitative phenomenology of the scahng 
laws and can thus be regarded as the paradigmatic case. One can arrive at the defining equations with dif- 
ferent arguments. A simple way is to assume that domain duplication is a rare event, described by a Poisson 
distribution with characteristic time r, during which there is a flux of external or new domain topologies 
|. Then pn = In this case the variables ^(n) are independent but not identically distributed. It is 
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immediate to verify that /(n) has mean value given by Z^"=i and thus grows as ~ ^logn. The same 
result can be obtained by thinking of domain addition as a dependent move, conditioned on n, /(n) or both. 

It is possible to consider different intermediate scenarios where the pool of old domain classes is in 
competition with the universe explorable by the new classes. The simplest scheme, which turns out to be 

quite general, can be obtained by choosing the conditional probability that a new class is bom (^(n+ 1) = 1) 
given the fact that /(n) = / at size n 

PN = —X- > (1) 

n + 

hence 

n~af 

PO = — —w- ' (2) 

where ^ > and a G [0, 1]. 

Considering per-class dupUcation probabihty, one can choose the following expression, that asymptoti- 
cally estabUshes the preferential attachment principle: 



i h-oi 

Po = ■ (3) 

Here, represents a characteristic number of domain classes needed for the preferential attachment principle 
to set in, and defines the behavior of /(n) for small n ( n ^ 0). a is the most important parameter, which 
sets the scaling of the duplication/innovation ratio. Intuitively, the smaller a, the more the growth of / is 
depressed with growing n, and since pat is asymptotically proportional to the class density //n it is harder 
to add a new domain class in a larger, or more heavily populated genome. As we will see, this implies 
Pn/po ^ as n — > oo, corresponding to an increasingly subdominant influx of new fold classes at larger 
sizes. This choice reproduces the subUnear behavior for the number of classes and the other scaling laws 
described in properties (i-iii). 

This kind of model has previously been explored in statistics under the name of Pitman- Yor, or Chinese 
Restaurant Process (CRP) [24, 25, 29, 30], where it is known as one of the paradigmatic processes that gen- 
erate partitions of elements into classes that are symmetric by swapping, or "exchangeable". This process 
is used in Bayesian inference and clustering problems [31]. In the Chinese restaurant (with table sharing) 
parallel, individual domains correspond to customers and tables are domain classes. A domain belonging 
to a given class is a customer sitting at the corresponding table. In a duphcation event, a new customer is 
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seated at a table with a preferential attachment principle, and in an innovation event, a new table is added. 



Mean-field theory for the CRP 



A simple mean-field treatment of the CRP allows to access its scaling behavior. Rigorous results for 
the probability distribution of the fold usage vector {ki, kp}, for /(n) = F, are in good agreement 
with mean-field predictions. It is important to note that for this stochastic process, the usual large-deviation 
theorems do not hold, so that large-n limit values of quantities such as do not converge to numbers, 
but rather to random variables [24]. Despite of this non-self-average property, it is possible to understand 
the scaling of the averages Ki and F (of ki and / respectively) at large n, writing simple "mean-field" 
equations, for continuous n. Note that rigorously the mean value Ki(n) is still a random variable, function 
of the (stochastic) birth time rii of class i. 

From the definition of the model, we obtain 



dnKi{n) = 



Ki{n) 



a 



n + e 



(4) 



dnF{n) 



aF{n) + 9 

n + e 



(5) 



These equations have to be solved with initial conditions Ki{ni) = 1, and F{1) = 1. Hence, for a 7^ 0, 
one has 



i^.(n) = (l-a)^ + e , 

Ui + U 



(6) 



and 



F{n) 



a 



{a + 6) 



n + e 



^ n 



(7) 



while, for a = 0, 



F{n) = e\og{n + e) ^ log(n) . 



(8) 



These results imply that the expected asymptotic scaling of F{n) is subUnear, in agreement with observation 
(i). 

The mean-field solution can be used to compute the asymptotic of P(j, n) = F{j, n)/F{n), following 
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the same line of reasoning used by Barabasi and Albert for the preferential attachment model [26]. This 
works as follows. From the solution, j > Ki{n) implies rii > n*, with n* = go that the 

cumulative distribution can be estimated by the ratio of the (average) number of domain classes bom before 
size n* and the number of classes bom before size n, P{Ki{n) > j) = F{n*)/ F{n). P{j,n) can be 
obtained by derivation of this function. For n, j — > oo, and j/n small, we find 

P{j,n) , (9) 

for a / 0, and 

P{j,n)^j , (10) 

for a = 0. The above formulas indicate that the average asymptotic behavior of the distribution of domain 
class populations is a power law with exponent between 1 and 2, in agreement with observation (ii). In 
contrast, the behavior of the model of Gerstein and coworkers [13, 18, 19] can be found in this framework 
by taking improperly a = 1, that is for constant p^, Po- It gives a linearly increasing F{n) and a power- 
law distribution with asymptotic exponent 2 for the domain classes. Note that the phenomenology of the 
Barabasi- Albert preferential attachment scheme [26] is reproduced by a CRP-like model where at each step 
a new domain class (corresponding to the new network node) with on average m members (the edges of the 
node) is introduced, and at the same time m domains are dupUcated (the edges connecting old nodes to the 
newly introduced one). 

It is possible to obtain the same results through a different route compared to the above reasoning, by 
writing the hierarchy of mean-field equations for F{j, n), using a master equation-like approach. Similarly 
as what happens for the Zero-Range process [32], these equations contain source and sink terms goveming 
the population dynamics of classes. Duplications create a flux from classes with j — 1 to classes with j 
members, while only F(l, n) has a source term coming from the innovation move: 

dnF{n)-- 
dnF{l,n) 

< 

dnF{2,n) 

We consider the hmit of large n, and use the ansatz F{j, n) = XjF{n). This ansatz can be justified 



_ a F{n) +0 



n+e 



a F{n) +6) 
n+6 

-(1 



a 



n+e 



(1 



«; n+e ^) n+e 



(11) 
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FIG. 2: (Color online) The classes with j members form a finite constant fraction of the total number of classes. 
F{j, n) (the number of classes with j members) is plotted versus F{n) for different values of j in empirical data (a) 
and simulations of the CRP (a = 0.32, 9 — 50) without (b) and with domain loss (6 — 0.2) in model with uniform 
loss (c) and weighted loss (d). In the plots from simulations, the lines in lighter colors represent 10 different realiza- 
tions, while the points with error bars are the mean value over them. The figure shows that adding domain loss to the 
model has no qualitative consequences in the behavior of the domain class histograms. 

empirically and by simulations, as shown in Fig. 2, which compares this feature in empirical data and in 
simulations of different variants of the model. 
We have 



dnF{n) 

axi = 
ax2 = 



a F{n) 
n 

a - {I- a)xi 
(1 - q)xi - (2 - a)x2 



(12) 



aXj = (j - 1 - a)Xj-i - (j - a)xj 
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giving 



2 X2 = (1 - tt) Xi 
. J Xi = (i - 1 - a) Xj- 



(13) 



The solution of these equations is 



X. "^r(j + i)" r(i-a) r(j + i) 



(14) 



which can be estimated as 



1 

Xj = a 



l+a 



(15) 



r(i-a) 

giving the result for the scahng of n) and its prefactor 



IV. DIRECT SIMULATION OF THE CRP AND FINITE-SIZE EFFECTS 

Going beyond scaling, the probability distributions generated by a CRP contain large finite-size effects 
that are relevant for the experimental genome sizes. In this section we analyze the finite size effect affecting 
the distribution over the domain classes, obtained performing direct numerical simulations of different CRP 
realizations. The simulations allow to measure F{n), and F( j, n) for finite sizes, and in particular for values 
of n that are comparable to those of observed genomes shown in Fig. 3. 

The normalized distribution of the number of classes with j domains over a genome of length n reaches 
the theoretical distribution suggested by our model only in the asymptotic Umit 

F{n) ~^ ^"^-^^ " r(i - a)r{j + 1) • ^'^^ 

It is possible to obtain more information by studying the ratio of the asymptotic distribution and the distri- 
bution obtained from the CRP simulation as shown in Fig. 4(a). The plot shows that there is a value m* (n), 
depending on the size n of the genome, beyond which the distribution is not anymore consistent with Pa{j) 
but shows exponential decay and large fluctuations. 

In order to obtain a quantitative estimate of the deviation from scaling generating the cutoff, we define 
an order parameter as follows. Using data obtained from the simulation as in Fig. 4(a), we find the mean of 
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the first 30 points of tlie plotted function and then compute the standard deviation a by analyzing windows 
of K points together. The cutoff is defined when a > 5, where (5 is a parameter. The result of this procedure 
can be seen in Fig. 4(b). The cutoff shows a linear dependence from the genome length n. To make sure 
the procedure does not depend too much on the number of iterations Nj used to obtain the mean value of 
F{j, n), we performed it for different values of Nj. As can be expected (Fig.4(c)), more statistics is needed 
for probing the cutoff trend in those regions where the probability density function is very small. Were it 
necessary to obtain from the distribution over domains n an estimate of parameters for the underlying CRP, 
one could decide to consider only data with j < jcutoff- At the scales that are relevant for empirical data, 
finite-size corrections are substantial. Indeed, the asymptotic behavior is typically reached for sizes of the 
order of n ~ 10^, where the predictions of the mean-field theory are confirmed. Comparing the histogram 
of domain occurrence for the mean-field solution of the model, simulations and data, it becomes evident 
that the intrinsic cutoff set by n causes the observed drift in the fitted exponent of the empirical distribution 
visible in Fig. 3. This means that the common behavior of the slopes followed by the population of domain 
classes for genomes of similar sizes can be ascribed to finite-size effects of a common underlying stochastic 
process. 
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FIG. 3: (Color online) (a) Cumulative distributions for prokaryotes (data from the SUPERFAMILY database). Each 
is the cumulative histogram of domains in classes for a single genome, (b) Fitted exponent of F{j, n) from empirical 
data of all prokaryotes, plotted as a function of n. 

Beyond the linear cutoff, the behavior of the distribution becomes realization-dependent due to the 
breaking of self-average [33]. The relevant parameter to disentangle the realization-dependence is /. High- 
/ realizations have different tails of the distribution from low-/ ones, giving rise to the large fluctuations 
observed in Fig. 4(a). Thus, while the mean-field approach is successful in predicting the asymptotic scaling 
of the distribution F{j, n), it does not capture the finite-size effects which can be observed in single real- 
ization of the CRP process with finite n and /. Beyond mean-field is possible to obtain more information 
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FIG. 4: (Color online) (a) Ratio of finite-size and asymptotic value for the distribution of domain classes population 
F{j, n) taken from 500 realizations of a CRP with a — 0.32 and 9 = 50. The three plots corresponds to different 
value of n. (b) Linear n-dependency for the cutoff in a CRP. The parameters in the plot are a = 0.1 and 9 = 200. A 
similar trend can be observed for other values of the parameters, (c) Cutoff trends obtained from a CRP with a = 0.44 
and 9 — 60 and a different number of iterations. The procedure has parameters K = 10 and S = 0.5. 



by considering the sum of all CRP trajectories conditioned to reaching configurations with given n and /. 

This enables a statistical-mechanical derivation of the normalized distribution of the number of domains 
with / classes over a genome of length (number of domains) n. Since the focus here is on the mean-field 
approach, the calculation is described in a parallel work [33]. 



V. OCCURRENCE OF DOMAIN CLASSES AND CRP REALIZATIONS 

The above model does not describe evolutionary time in generations. Conversely, it reproduces random 
ensembles of different genomes generated one from the other with the basic moves of duplication, innova- 
tion (and loss, see below). It considers only events that are observed at a given n, independently on when or 
why they happened in physical or biological time. Genomes from the same realization can be thought of as 
a trivial tree of life, where each value of n gives a new specie. In the case including domain deletions, more 
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genomes of the same history can have the same size. In contrast, independent realizations are completely 
unrelated. 

The scaling laws in F{n) and F{j, n) hold for the typical realization, indicating that the scahng laws 
originate from the basic evolutionary moves and not from the fact that the species stem from a common 
tree with intertwined paths due to common evolutionary history. For example, two completely unrelated 
realizations will reach similar values of F at the same value of n. 

The data confirm this fact: phylogenetically distant bacteria with similar sizes have very similar number 
and population distribution of domain classes (see Fig. 8). 

While the scaling laws are found independently on the realization of the Chinese restaurant model, the 
uneven occurrence of domain classes can be seen as strongly dependent on common evolutionary history. 
Averaging over independent realizations, the prediction of the CRP is that the frequency of occurrence of 
any domain class would be equal, as no class is assigned a specific label. In the Chinese restaurant metaphor, 
the customers only choose the tables on the basis of their population, and all the tables are equal for any 
other feature. 

In order to capture this behavior with the model, one can consider the statistics of domain topology 
occurrence of a single realization, which is an extremely crude, but comparatively more realistic description 
of common ancestry. In other words, in this case, the classes that appear first are obviously more common 
among the genomes, and the qualitative phenomenology is restored, without the need of any adjustment in 
the model definition (Fig. 5). 
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Domain Class 

FIG. 5: (Color online) Single realization averages reproduce qualitatively empirical topology occurrence (the fraction 
of genomes where the given domain topology is found, normalized by the total number of genomes). Domain topology 
ranked occurrence in typical realizations of the CRP with different parameters (symbols), compared to the empirical 
data for bacteria (continuous line). The simulated data were obtained considering the occurrence curve for 500 
realizations of the process for all sizes from n — 300 to n = 8000 ("uniform"), i.e. the range of sizes observed in the 
data, or directly for the set of empirical sizes of the genomes ("sampled"). 
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VL DOMAIN LOSS 



Loss of genes, and thus of domains, is reported to occur frequently in genomes. We will discuss now 
variants of the model considering the introduction of a domain deletion, or loss rate. The question we ask is 
whether the introduction of domain loss, which we consider mainly as a perturbation, affects the quaUtative 
behavior of the model, for example by generating different scaling behavior or phase transitions. We will see 
that the answer to all these questions is mostly negative even for non-infinitesimal perturbations, provided 
the loss rate is constant and does not scale itself with F and n. The main exception to this behavior is found 
when the loss probability of a domain depends on its own class size. 

We introduce domain loss through a new parameter 5, which defines a loss probabihty pl in two a 
priori different ways. (1) We can distribute this probabihty equally among domains, so that the per-class 
loss probability is p\ = 5^. Consequently, the duplication and innovation probabihties po and pat are 
rescaled by a factor [1 — 5). (2) We can also weigh the loss probability of a domain on its own class size, 
in the same way as domains are duplicated in the standard CRP, so we obtain a per-class loss move with 
probability p\ = ^^jfpf , giving a total p^ = ^ ""n+e rescaling of po and pN- We will see that model 

(1) and (2) are not equivalent. 

On technical grounds, the introduction of domain loss makes the stochastic process entirely different: 
n is now a random variable, and all the observables that depend on it (e.g. F{n)) are stochastic functions 
of this variable. Another parameter, t, describes the iterations of the model. Operatively, we tackle the 
two models with the usual mean-field approach, writing equations for n and F of the kind dtn = Q{n, F), 
dtF = R(n, F), and hence obtain the behavior as a function of n by considering dnF = Q^^^'^j . The exact 
meaning of these equations is not straightforward. For example F{n) should represent the average on all 
histories passing by n, but the differential equation strictly describes only the dependence of the observable 
from the actual value of the random variable n. Nevertheless, the predictions of this mean-field approach 
agree well with the results of simulations, indicating that these complications typically do not affect the 
behavior of the means. We will consider situations where, on average, genomes are not shrinking. 

Considering model (1), we can write the mean-field equations as 



dtF{t) = {l-6) 



aF{t) + e 
n + 9 




n 



(17) 



dtm) = {i-5) 



Ki{t) - a 

n + e 



n 



(18) 
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where the sink term for F derives from domain loss in classes with a single element, quantified hy F{l,n). 
Since time does not count genome size, one has to consider the evolution of n with time t, given in this case 
by dtn = 1 — 25. In order to solve these expressions, we use the ansatz F{1, n) = xiF{n), and considering 
the limit in large n. The ansatz is verified by simulations and holds also for empirical data, as previously 
shown (Fig. 2). The first equation reads: 



The above equation gives the conventional scaling for F{n) and K{n) with a replaced by or = 
iki^}^^^2<A^ correction resulting from the measured value of xi. By the use of computer simulations 
we notice that the xi coefficient tends to a for infinite-size genomes (Fig. 6(a), 6(b), and 6(c)), so that 
the asymptotic trend of the equally distributed domain loss is identical to that of the standard CRP. This 
behavior is independent from the chosen 6 < 1/2, as the asymptotic regime depends only on the growth of 
F{n), governed by a. 

When we analyze the domain distribution of finite-size genomes, we obtain the conventional results: the 
power-law depends on the genome size, but not on the value of 5 (Fig. 6(b) and 6(d)). This is explained 
considering the fact we are comparing runs with fixed genome size, thus with different number of moves, 
and we do not consider genomes that lose all their own domains. In fact biologically one cannot trace the 
number of moves needed to reach a specific genome, but essentially we can observe only genomes in their 
actual state. 

More precise results can be obtained by the use of the mean field "master equation" approach sketched 
above. Using the same ansatz F{j, n) = XjF{n), we obtain the following hierarchy of equations for Xj 



with B = {1 — 6)a — dxi ■ It is possible to estimate the solution of this system by taking a continuum limit 
as 



dnFjn) ^ 1 Ul-S)a-6xi 
F{n) n[ 1-2(5 



(19) 



[B + 5j + (1 - 5){j - a)]xj = (1 - - 1 - a)xj-i + 5{j + , 



(20) 



Bx{x) + (1 - 5)d^[{x - a)x{x)] - 5d^x{x) = , 



(21) 



which can be solved giving 



X{x) 



"11^+^ 



(22) 



X 
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(c) (d) 

FIG. 6: (Color online) A model with uniform domain loss (model (1)) does not lead to a change of qualitative behavior 
with respect to absence of domain loss. We estimated xi from simulations with a linear fit of the measured (linear) 
F{l,n). The measured coefficient xi (a) approaches with the parameter a, hence the observed scaling exponent 
aji, measured from the simulated F{n) agrees very well with the imposed a. (b) This trend is weaker for smaller a 
but can be regarded as a finite size effect, as the agreement improves (c) for increasing n. The plot in (c) refers to 
9 = 75,6 = 0.35. Finally (d), the same phenomenology holds for a = 0, where the observed parameter On extracted 
from the behavior of F{n) in simulations agrees well with 6. 

with Z = j32s- ^l^o ^^'^ F{n) = , which is consistent with the constraint Yl,j JXj = pj^ since 
J2j jXj ~ n^^^ . It is then clear that the introduction of domain loss is equivalent to a rescaling of the 
parameter a to = Z{a, 6), but in our case, Z{a, 6) = a asymptotically. 

The case a = has to be treated separately, but the behavior is similar (Fig. 6(d)). 

A similar procedure is applicable to model (2). In this case, however, the dependence of the effective 
death rate from n can bring to an interesting change in the phenomenology, where 6 can select the observed 
exponent aji and also determine a regime of linear growth for F{n). 

To understand this point we can consider the mean-field evolution of F{n). This is determined asymp- 
totically by the balance of a growth term p]\f ~ aF/n and a loss term pi ~ 6F{l,n){l — a)/n. With the 
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FIG. 7: (Color online) In a model with weighted domain loss (model (2)), the loss probability can trigger a transition 
between two different scaling behaviors for F{n). (a) behavior of observed exponent versus a in simulations. The 
exponent saturates to unity at a critical value that is lower than one. The dashed lines are mean-field predictions in 
the ansatz of F{n) sublinear in n. (b) "Phase-diagram" for the behavior of an for different values of a and 5. The 
dashed line is a mean-field estimate of the transition point io cur. = 1, while circles and diamonds are the an — .9 
and ur — .995 lines from simulations at n = 10^. 



usual ansatz this gives an asymptotic evolution equation of the kind 

a — 6{1 — a)xi 



n) 



F(n) 



[l-25)n + 26aF{n) 



(23) 



with the usual definitions. Simulations confirm the ansatz F{l,n) = xiF(n) for this second model, which 
we will use in the mean-field reasoning. If F{n) is intensive, i.e. asymptotically of order inferior to n, the 
term 26aF can be neglected and this equation gives the scaUng law F ~ n"^, with aji = ^ 
However, this equation has solution also if F is order n, i.e. an = 1, and the term 26aF in Eq. (23) cannot 
be neglected. In this case a and 6 determine the prefactor of the scaling law F ^ n. 

Thus, there are two self-consistent mean-field asymptotic solutions, and we expect a transition between 
the two distinct behaviors. The existence of this transition is confirmed by simulations (Fig. 7(a)): at fixed 
5, Or saturates to or ~ 1 for larger values of this parameter. The transition point can be understood in 
mean-field as the intersection of the two solutions or = 1 and aji = ""'^i^^2^~"^ varying 5, and gives 
rise to a two-parameter "phase-diagram" separating the linear from the sublinear scaling of F{n) with n, as 
shown in Fig. 7(b). 

In conclusion, the mean-field approach is effective in exploring the effects of domain loss, which, under 
some general hypotheses, does not disrupt the basic phenomenology of the duplication-innovation model. 
Specifically, there appear to be no qualitative changes introduced by a finite uniform loss rate, as long as this 
rate is constant with n. A loss rate that is weighted as the innovation rate, instead, can induce an interesting 
transition from sublinear to linear scaling. 
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Thinking about the empirical system, no direct quantitative estimates are currently available regarding 
the domain loss rate as a function of genome size or number of classes. For this reason, it currently appears 
difficult to make a definite choice for this ingredient in the model. 

VII. MODELS WITH DOMAIN CLASS SPECIFICITY 

In the previous sections we analyzed models that make no distinction between domain topologies, but 
the latter are selected for duplication moves only on the basis of their population. 

It is then clear that they can reproduce the observed qualitative trends for the domain classes and their 
distributions with one common set of parameters for all genomes. One further question is to estimate the 
quantitative values of these parameters for the data. While the empirical slope of F{n) could be seen as 
more compatible with a model having a = 0, as its slope decays faster than a power law for large values of 
n, the slopes of the power-law distribution of domain classes P{j, n) and their cutoff as a function of n is 
in closer agreement to a CRP with a between 0.5 and 0.7 [23]. 

We will now discuss a CRP variant that is able to distinguish domain classes based on a priori informa- 
tion, thus breaking the symmetry of the model by exchange of tables. 

This is equivalent to the introduction of differently colored "tablecloths" labeling the tables, that are 
determinant in the choice for one table or the other. 

Those table colors can be set by any observable of interest. In our analysis, we considered the empirical 
occurrence of a domain topology as a label. Indeed, the occurrence of a given domain class is determined by 
its biological function. For example, as expected, all the "core" biological functions such as translation of 
proteins and DNA replication are performed by highly occurring domain topologies, since this machinery 
must be present in each genome. Accordingly, these universal classes performing core functions have to 
appear preferentially earlier on in a model reahzation. This variant of the model is important for producing 
informed null models for the analysis of the empirical data, and, as we will see, shows the best agreement 
with respect to the scaling laws. 

We will introduce the variant with class specificity by coupling the CRP model to a simple genetic 
algorithm able to select between innovation moves that choose different classes. Let us first introduce some 
notation to parametrize domain class occurrence. We define the matrices af where: 



1 domain i found in genome o 

(24) 

— 1 domain i not found in genome g 



20 



100 



100 - 



100 



n~1600 



bl, Ij, Is, ni, pg, pp, 
sd, sp, tg, ts, x1 




VISOO o I oo 

I N] I 



n~1800 



a9, c5, Ix, 
s4, s5, sz 




n~3921 



o ec, ew, rr 




J I I iKi I I I 



n~1500 

o Iv, oc, pO, x2 




(mco o . oo 
I ^ I 



n~1900 



cd, ch, jk, 
so, ti, tr, ws 




lOOBD . OOQE I 

I 



n~4050 



bO, bx, rx, 
vb, vp, x5 




10 



10 



J 



FIG. 8: (Color online) Comparison of F{j, n) from the SUPERFAMILY Database with mean-field predictions and 
simulations of the model with class specificity. Comparison between n— dependence of powerlaw fit (lines in blue) 
and universality of the two parameters from our model. Two letters identify each genome, whose full name can be 
found in Appendix (Table B. 1). Simulations (lines in gray) from our model use the same values a = 0.75 and 6 = 32. 



It is possible to consider the mean taken along the matrix columns, 



9 



where the label "emp" means that this value is obtained from empirical data (Fig. 5). 

Generically, a genetic algorithm requires a representation of the space of solution Q and a function /(w) 
that tests the quality of the solution computed. In our case, the former is simply the genome obtained from 
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the a CRP step, parametrized by erf. The latter is defined as 

m = n ( (-rO) = exp f E < >) • (26) 

i \ i / 

The value of the above scoring function taken over simulated genomes measures how much the set of 
domain classes they possess agrees with experimental data (in this case on occurrence) and enables to 
compare different "virtual" CRP moves. 

We consider the case of two virtual genomes g' and g" generated through standard CRP steps, for 
simplicity without domain loss, from an initial size n and genome g{n). 

Note that in this variant, since the empirical domain topologies are a finite set, domain classes are also 
finite. As a consequence, tables with a given tablecloth are extracted without replacement, affecting the 
pool of available colors. As we will see, this is an important requirement to obtain agreement with the data, 
as it determines the saturation of the function F{n) also for large values of a. We will discuss the role of 
an infinite pool in the following subsection. 

Also, as anticipated, domain classes have different "color", or in mathematical terms the exchangeability 
of the process is lost. Classes are drawn from the set of the residual ones with uniform probability. Genomes 
g' and g" are compared through the function and the highest score one will be the genome g{n + 1) so 
that5(n + 1) = argmax(J"(£?'), 

In these conditions, the rigorous results present in the literature for the CRP cease to be valid. It is still 
possible, however, to analyze the behavior of this variant by the mean-field approach adopted here, and to 
compare with simulations. Since the selection rule chooses strictly the maximum, it is essentially able to 
distinguish the sign of (crf^^) only. For this reason, it is sufficient to account for the positivity (which we 
label by "+") and negativity ("-") of this function for a given domain index i. This means that, with the 
simplification of two virtual moves only, the model introduces only one extra effective parameter, i.e. the 
ratio of the "universal" (positive J^) to the "contextual" (negative domain classes. 

In order to write the mean-field equations for this model variant, we first have to classify all the possible 
outcomes of the virtual CRP moves. The genomes g' and g" proposed by the CRP proliferation step can 
have the same (labeled by "1"), higher ("1+") or lower ("1_") score than their parent, depending on po, 
Pn and by the probabihties to draw a universal or contextual domain family, and p- respectively. Using 
these labels, the scheme of the possible states and their outcome in the selection step is given in Table 1. 

From the Table 1, it is straightforward to derive the modified duplication and innovation probabilities po 
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proliferation [g ,g ) 


1 1 '1 • J- 

probability 


selection 


(1,1) 


Po 


old 


(1,1-) 


2 Po PN 


old 


(1,1+) 


2 Po PN P+ 


new+ 


(1+,1+) 


pn pi 


new+ 


(1+,1-) 


2 p\j p_ p+ 


new+ 


(1-:1-) 




new- 



TABLE I: Compound probabilities of picking up a new domain or an old domain with positive or negative cost 
function. 



and Pat of the complete algorithmic iteration: 

PO =PO {PO + ^PNP- ) 



(27) 



PN = PN {PN + 2 POP+) = PN+ + PN- , 



(28) 



where pn+ = PnP+C^ — PnP+) and pAr_ = — are the probabilities that the new domain is 
drawn from the universal or contextual families respectively. 

We now write the macroscopic evolution equation for the number of domain families by the usual pro- 
cedure. Calling K'^{n) and K~(n) the number of domain classes that have positive or negative (crf™^) 
and are not represented in g{n). 



dnF{n) = PN 
dnK+{n) = -PN+ ■ 
^ dnK~{n) = -pN- 

Now, p+ = K+/{K- + K+) = K+/{D - F{n)), so that we can rewrite 



(29) 



dnK+{n) = -( 
dnK-{n) = 



dnF{n)= C^^) 

aF{n)+e 

n+e 



ceF(n)+e 

n+e 

X K+{t) 
) D-F{n) 



2K+{n) f n-aF{n) \ 
D-F[n) \ n+e > 
( aF{n)+e s K+{n) 
V n+e ' D-F[n) 



( aF{n)+e ^2 ( J<Mn)_\2 
V n+e > ^D-F{n)' 



(30) 



The above equations have the following consistency properties 
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• dn[K+ + K- + f') =0, hence K+ + R- + F = D Vn. 



• dnF < 1> hence F{n) < n. 

• dnF > 0, dnK+ > and dn{F + K+) > so that F grows faster than K+ decreases. 

Choosing the initial conditions from empirical data no, F{no), size and number of domain classes of the 
smallest genome, we have, since F{no) < uq and a < 1, 



It is simple to verify that under this condition the system always has solutions that relax to a finite value 
Foo < D. Indeed, after the time n* where iC+(n*) = 0, the equations reduce to dnK^ = 0, K~ = D — F 
and 



immediately giving our result. It is important to notice that this result depends on the fact the empirical 
reservoir of domains is finite (and thus on the biological hypothesis that the full pool of available domain 
topologies is). Indeed, in presence of an infinite reservoir of domain classes, F{n) does not relax to a finite 
value. In this case, similarly as in the case of non uniform loss, there is a transition from linear to subhnear 
scahng piloted by the parameters p+ and a. If 2pj^a <1,F^ •n?P+'^, while if 2p+Q; > \, F ^ n. 

Numerical solutions of Eq. (30) give the same behavior for F{n) as the direct simulations (Fig. 9). In 
particular, while this function grows as a power law for small genome sizes, it saturates at the relevant 
scale, giving good agreement with the data (Fig. 10). The internal laws of domain usage of this model 
were obtained from direct simulations only, and give a good quantitative agreement with the data (Fig. 10). 
Finally, Fig. 9 also shows that, for large values of a (above 0.7) this function reaches a maximum at sizes 
between 2000 and 4000. This is also where most of the empirical genomes are found, indicating that this 
range of genome sizes may allow the optimal usage of universal and contextual domain families. 



We presented from a statistical physics viewpoint a class of duplication-innovation-loss stochastic pro- 
cesses able to describe the probability distributions and scaling laws observed in genomic data sets for 
protein structural domains with few effective parameters. These models are different declinations of the ba- 
sic paradigm set by the Chinese restaurant process which, though much explored in the statistical Uterature, 



aF(no) + 9 
riQ + e 



< 1 . 
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FIG. 9: (Color online) Numerical solutions of the mean-field equations of the CRP model with selection of specific 
domain classes. Left panel: score function J-(n) for different values of a. Right panel: F[n) plotted in linear and 
logarithmic (inset) scales. 
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FIG. 10: (Color online) Comparison of the CRP with class specificity with experimental data. CRP parameters are 
a — 0.75 e = 0.32. Number of domain classes versus the length n of genome. Simulations were performed for 
n < 10000 taking data at each interval of 1000 units in size, and considering 5 realizations at each step. 

remains relatively unexplored by the physics community, despite of its rich and peculiar phenomenology, 
which could make it useful in multiple applications. 

Our focus has been to present the basic phenomenology of different variants of the model connected to 
possibly relevant aspects of the evolutionary dynamics of protein domains, prominently domain duplication, 
innovation, loss, and the specificity of domain classes. In doing this, we have shown how a mean-field 
approach, despite of its simplicity, can be extremely powerful in the analysis of the qualitative behavior of 
this class of models. More subtle aspects of these models might be approached by simulations and refined 
statistical mechanics methods, directly accessing the sum on all different paths. 

For the standard CRP, we have shown how the scaling laws in the number and in the slope of the ob- 
served power-law distributions of domain classes are qualitatively reproduced, by the typical or the average 
realization. The salient ingredient for this feature is that this model can include the correct relative scaling 
of innovation to duplication moves, which can be generated by statistical dependence of innovation, or by 
the fact that the process of domain birth does not give rise to identically distributed random variables at 
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different sizes. 

The mean-field results can be obtained in two independent ways: with dynamic equations for the pop- 
ulation and number of domain classes, or, similarly as for the Zero-range process [32], by considering the 
evolution of the number of classes with a given population. The latter "master equation" approach gives 
access to the full histogram of domain class population. The scaling of the slope of the domain class 
distributions can be understood as a finite-size effect on these histograms. We have also shown how the 
uneven occurrence of domain classes can be explained by the CRP provided one considers the statistics of 
occurrence in a single reaUzation. This can be interpreted as the fact that genomes are related by common 
evolutionary paths. 

This phenomenology is extremely robust to the introduction of finite domain loss probabilities that do 
not scale with n or F. For uniform domain loss, the full mean-field equations are still accessible analytically, 
and point to the effect of domain loss as a simple effective rescaling of the model parameters. The presence 
of a loss rate that scales with size, instead, can trigger a transition from sublinear to linear scaling in the 
number of distinct classes. 

Finally, we have discussed the case of models that introduce the specificity of domain classes, and thus 
explicitly breaks the symmetry between them, as expected for the biological case. Such variants can be very 
useful in more detailed statistical investigations (e.g. as a null model) and to define inference problems on 
the available bioinformatic data. 

For example, a preferential duplication of conserved proteins has been reported in eukaryotes [34], and 
could be tested against a duplication-innovation-loss model where a "conservation index" characterizes 
specificity. 

Here, we have shown how a CRP variant with specificity can be formulated as a genetic algorithm where 
different CRP virtual moves are selected on the basis of an informed scoring function, defined on the basis 
of further empirical observables related to domain classes (such as function, correlated occurrence, etc.) 
In our case, we have shown how such a variant, weighting the virtual moves according to the observed 
occurrence of the finite pool of domain classes has very good quantitative agreement with the available 
data. 

The future perspective on this systems and models are abundant, both for the application of the models to 
the genomics of protein domains, where the most promising ways seems to be the use in specific inference 
problems on species with known evolutionary histories, and in other problems of statistical physics and 
complex systems, for example to develop new growth models for complex networks where the nodes can 
evolve with similar moves. One prominent example is the case of protein-protein interaction networks, 
graphs where the nodes follow a dynamics that is coupled to those of protein domains, while the edges can 
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be inherited by duplication, lost, or rewired by mutation and natural selection. 
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Appendix A: Data and methods 

All the data considered here were extracted from the SUPERFAMILY database version 1.69 and 1.73. 
We considered domains at the superfamily level, although similar results hold for folds and families. A 
genome "size" n was defined as the number of domain hits for each genome. The main data structures 
considered were the number of domain classes and the histograms of domain class population. We also 
considered the ranked occurrence of a given domain class across genomes. 

The different variants of the duplication-innovation-loss model were simulated directly using a C-i~i- code 
and compared with the mean-field results and the empirical data. The variant with class specificity couples 
the CRP model to a simple genetic algorithm selecting innovation moves that choose distinct classes, on the 
basis of their empirical occurrence. For all the variants, we derived and solved the mean-field equations for 
the evolution of the main observables, and described alternative approaches for their solution. 

The finite-size behavior of a CRP was studied comparing the finite-size histograms obtained from direct 
simulations with the asymptotic limit of the exact analytical result (formula 16) [24], and defining a cutoff 
as the class-size where the deviation was larger than an arbitrary threshold. The cutoff was studied as a 
function of n. For the CRP variants with specificity and domain loss only the mean-field solution given here 
is available for the same analysis. The exponent of the empirical domain class distributions for genomes 
with different size was estimated from a power-law fit of the cumulative histograms, keeping into account 
the finite-size cutoff. 
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Appendix B: List of Bacteria Used in the Analysis of Fig. 8 



Abbrevistion 


Full Name 


a9 


Streptococcus agalactiae A9 


bU 


Bacillus anthracis Sterne 


Dl 


Baumannia cicadellinicola He 


Dl 


DinaoDactenuni longutn rsucz/Uj 


DX 


iDacillus cereus AlL-L. lUVo/ 


CJ 


Chlorobium chlorochromatii CaD 


CQ 


coryneDactenuni aipntneriae iNCiL. ijizy 


cn 


CnloroDium tepiuum iLo 


ec 


Escherichia coh K12 


ew 


Escherichia coh W^3 110 


iir 


Corynebacterium jeikeiuiti K411 


y 


Lactobacillus johnsonii NCC 533 


It- 
IS 


Lactobacillus sakei ssp. sakei 23K 


iv 


Lactobacillus salivarius ssp. salivarius UCC118 


Ix 


i^eiisonia xyli ssp. xyli v_ 1 L.r>U / 


ni 


Neisseria gonorrhoeae FA 1090 


oc 


Prochlorococcus marinus NATL2A 


pu 


Candidatus Protochlamydia amoebophila UWE25 


Pg 


Porphyromonas gingivalis W83 


PP 


Streptococcus pyogenes MGAS2096 


IT 


Rhodopseudomonas palustris BisB5 


rx 


Rhodoferax ferrireducens Til 8 


c/l 


Streptococcus agalactiae 2603V/R 


sj 


Streptococcus agalactiae NEM316 


scl 


Streptococcus thermophilus LMG 18311 


so 


SjTiechococcus sp. CC9605 


sp 


Streptococcus pyogenes MGAS 10394 


sz 


Synechococcus sp. CC9902 


tg 


Streptococcus pyogenes MGAS 10750 


ti 


Thiomicrospira denitrificans ATCC 33889 


tr 


Thermus thermophilus HB8 


ts 


Streptococcus pyogenes MGAS 10270 


vb 


Vibrio vulnificus YJ016 


vp 


Vibrio parahaemolyticus RIMD 2210633 


ws 


Wolinella succinogenes DSM 1740 


xl 


Streptococcus thermophilus CNRZ1066 


x2 


Streptococcus pyogenes Ml GAS 


x5 


Bacillus cereus E33L 



TABLE B.l: Abbreviations used for the genomes in Fig. 8. 



